Adaptive Partitioning for Very Large RDF Data

نویسندگان

Razen Al-Harbi

Ibrahim Abdelaziz

Panos Kalnis

Nikos Mamoulis

Yasser Ebrahim

Majed Sahli

چکیده

State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation, while others apply heuristics aiming at minimizing inter-node communication during query evaluation. This requires an expensive data pre-processing phase, leading to high startup costs for very large RDF knowledge bases. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes; as a result, inter-node communication cannot be consistently avoided for queries that are not favored by the initial data partitioning. In this paper, we propose AdHash, a distributed RDF system, which addresses the shortcomings of previous work. First, AdHash applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdHash takes full advantage of the partitioning to (i) support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdHash monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. R. Harbi · I. Abdelaziz · P. Kalnis · M. Sahli King Abdullah University of Science & Technology, Thuwal, Saudi Arabia E-mail: {first}.{last}@kaust.edu.sa N. Mamoulis University of Ioannina, Greece E-mail: [email protected] Y. Ebrahim Microsoft Corporation, Redmond, WA 98052, United States E-mail: [email protected] As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdHash implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication

The Resource Description Framework (RDF) and SPARQL query language are gaining wide popularity and acceptance. In this paper, we present DREAM, a distributed and adaptive RDF system. As opposed to existing RDF systems, DREAM avoids partitioning RDF datasets and partitions only SPARQL queries. By not partitioning datasets, DREAM offers a general paradigm for different types of pattern matching q...

متن کامل

Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning

Massive volumes of big RDF data are growing beyond the performance capacity of conventional RDF data management systems operating on a single node. Applications using large RDF data demand efficient data partitioning solutions for supporting RDF data access on a cluster of compute nodes. In this paper we present a novel semantic hash partitioning approach and implement a Semantic HAsh Partition...

متن کامل

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

The volume of RDF data increases very fast within the last five years, e.g. the Linked Open Data cloud grows from 2 billions to 50 billions of RDF triples. With its wonderful scalability, cloud computing platform like Hadoop is a good choice for processing queries over large data sets. Previous works on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through c...

متن کامل

PHD-Store: An Adaptive SPARQL Engine with Dynamic Partitioning for Distributed RDF Repositories

Many repositories utilize the versatile RDF model to publish data. Repositories are typically distributed and geographically remote, but data are interconnected (e.g., the Semantic Web) and queried globally by a language such as SPARQL. Due to the network cost and the nature of the queries, the execution time can be prohibitively high. Current solutions attempt to minimize the network cost by r...

متن کامل

Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases

Governments, corporations, startups, open data initiatives and other organizations are increasingly considering RDF and SPARQL in a broad range of information management scenarios. To reduce SPARQL querying times has been the main issue for virtually all the recent RDF triplestores, yet SPARQL caching techniques have not been broadly considered. In this paper we present Rendezvous, a middleware...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1505.02728 شماره

صفحات -

تاریخ انتشار 2015

Adaptive Partitioning for Very Large RDF Data

نویسندگان

چکیده

منابع مشابه

DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication

Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

PHD-Store: An Adaptive SPARQL Engine with Dynamic Partitioning for Distributed RDF Repositories

Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases

عنوان ژورنال:

اشتراک گذاری